This Document Is Best Quality Available. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly. an Alternative to Correspondence Analysis Using Hellinger Distance
نویسنده
چکیده
In this paper, a general theory of canonical coordinates is developed for reduction of dimensionality in multivariate data, assessing the loss of information and plotting higher dimensional data in two or three dimensions for visual displays. The theory is applied to data in two way tables with variables in one category and samples (individual or populations) in the other. The method is applicable to data with continuous measurements on the variables as well as to frequencies of attributes. An alternative to the usual correspondence analysis of contingency tables based on Hellinger rather than the chisquare distance is suggested. The new method has some attractive features and does not suffer from some inherent drawbacks resulting from the use of the chi-square distance and variable sample sizes for the populations in the correspondence analysis. The technique of biplots where the populations and the variables arc represented on the same chart is discussed. 1. Canonical Coordinates The concept of canonical variates (coordinates) was introduced in an early paper by the author (Rao (1948)) for graphical representation of taxonomical units characterized by multiple measurements. This was, perhaps, the first attempt to reduce high dimensional data to two or three dimensions using an objective criterion for purposes of graphical displays. Since then, graphical representation of multivariate data for visual examination of clusters, outliers and other structures in the data has been an active field of research. Some of the developments are biplots (Gabriel (1971), Gifi (1990), Nishisato (1980), Gower (1993), Greenacre (1993)), multidimensional scaling (Kruskal and Wish (1978)), correspondence analysis (Benzecri (1992), Greenacre (1984)), Chernoff's faces (Chernoff (1973)) and parallel coordinates (Mahalanobis, Mazumdar and Rao (1949), Wegman (1990)). Cavalli-Sforza (1991) uses canonical coordinates (variables) in interpreting the evolution of human populations. The object of the present paper is to briefly review the concept of canonical coordinates as originally introduced in 1948 and later elaborated in Rao (1964, 1979, 1980, 1985) in the light of modern developments and present an alternative to the current practice of correspondence analysis, which seems to have some attractive properties. In Section 2 we consider the general problem of transforming the points of a /^dimensional vector space endowed with a specified inner product to a lower dimensional Euclidean space with the usual definition of inner product and distance. The solution to the problem is considered in a more general set up than what is possible through the use of Eckart and Young (1936) theorem. In Section 3, some measures are introduced to assess the loss of information in reduction of dimensionality. The role of biplots and their interpretation are also discussed. An alternative to correspondence analysis applied to contingency tables based on Hellinger rather than the chisquare distance is given in Section 4. It is argued that the chisquare distance used in correspondence analysis is not an intrinsic measure of the difference between two given population distributions as it depends to some extent on the whole set of populations considered in the study, and also on the sample sizes available for the estimation of population distributions. In such a case, the configuration of a subset of the populations as revealed by correspondence analysis may depend on what other populations are included in the analysis. An example is given to show how anomalies can arise in correspondence analysis based on the chisquare distance. On the other hand no such anomalies arise with the use of Hellinger distance. 1991 Mathematics Subject Classification. (121130, G2H17.
منابع مشابه
This Document Is Best Quality Practicable. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly
متن کامل
This Document Is Best Quality Practicable. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly
متن کامل
This Document Is Best Quality Practicable. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly
*This research was conducted at the M.I.T. Laboratory for Information and Decision Systems with partial support provided by NSF under Grant NSFECS-8310698 and by ÜARPA under Contract 0NR/N00U14-84-K-03S7. This paper is an expanded version of the Shannon Lecture at the International Symposium on Information Theory at St. Jovite, Quebec, in September, 1983. **Room No. 35-206, Laboratory for Infor...
متن کاملThis Document Is Best Quality Available. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly
Public reporting burden tor this eotjeeoon o* nformsoon is eeornsted to average 1 hour par response, mdudtng As am« tor raviewing instructions, searching existing data aouroaa. galhamg and maintaining ths data nsstlsü. and completing and reviewing the collection of Intormanon. Ssnd comments regarding his burden estimate or any omar aspect ot this coltoeson ol intoimatlon, «eluding suggeenona to...
متن کاملThis Document Is Best Quality Practicable. the Copy Furnished to Dtic Contained a Significant Number of Pages Which Do Not Reproduce Legibly
V Because of th nation's increasing demand for sore *<>lecommunication capacity, there is a continuing r.e^-i for nor* efficient ways of sharing ehe radio spectrum. Th-3 ccr.ver.tior.ai ways o* allocating the soeC-rum are by frequency, space ar.d time division. However, for systems using new technology this is inefficient. Hence, it is desirable to re-examine alternative procedures that might b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997